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Foreword 


A  joint-service  effort  is  in  progress  to  develop  a  computerized  adaptive  testing  (CAT)  system 
and  to  evaluate  its  potential  for  use  in  the  military  entrance  processing  stations  as  a  replacement  for 
the  paper-and-pencil  Armed  Services  Vocational  Aptitude  Battery  (ASVAB).  The  Department  of 
the  Navy  has  been  designated  as  lead  service  for  CAT  system  development  and  the  Navy  Personnel 
Research  and  Development  Center  has  been  designated  as  lead  laboratory. 

This  research  was  funded  under  CAT-ASVAB  Program  Element  ()6()4703N,  Work  Units 
R1822-MH()01  and  R1822-MH()01  A,  and  reimbursable  Navy  funding  (0&M,N),  sponsored  by  the 
Bureau  of  Naval  Personnel  (PERS-23). 

This  research  was  part  of  the  overall  evaluation  of  CAT-ASVAB.  This  report  presents  the 
results  of  an  evaluation  of  the  effect  on  adaptive  scores  of  item  calibration  medium  of 
administration.  The  data  were  collected  by  RGI,  Inc.,  pursuant  to  contract  N66001-86-C-0217. 
Results  are  directed  toward  technical,  professional,  and  contractor  personnel  involved  in 
implementing  CAT. 


JOHN  D.  McAFEE  RICHARD  C.  SORENSON 

Captain,  U.S.  Navy  Technical  Director  (Acting) 

Commanding  Officer 
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Summary 


Problem 

The  Navy  Personnel  Research  and  Development  Center  is  conducting  research  to  design  and 
evaluate  a  computerized  adaptive  test  (CAT)  as  a  potential  replacement  for  the  paper-and- 
pencil  (P&P)  Armed  Services  Vocational  Aptitude  Battery  (ASVAB).  In  support  of  this  effort,  the 
Accelerated  CAT-ASVAB  Program  (ACA?)  is  evaluating  item  pools  specifically  developed  for 
computerized  adaptive  testing. 

An  important  question  in  the  development  of  item  pools  for  computerized  adaptive  tests  is 
whether  data  for  calibrating  items  should  be  collected  by  a  P&P  a  computer  administration  of  the 
items.  If  P&P  administrations  do  not  yield  precise  enough  calibrations,  items  must  be  administered 
by  computer  for  calibration  just  as  they  are  during  testing.  Since  the  CAT-ASVAB  item  pools 
have  been  calibrated  using  P&P  administrations,  this  is  an  issue  of  interest  for  ACAP  research. 

Objective 

The  objective  of  this  study  was  to  evaluate  the  effect  on  adaptive  scores  of  using  a  P&P 
calibration.  Specifically,  to  what  extent  do  adaptive  scores  obtained  with  computer-administered 
items  and  a  P&P  calibration  correspond  to  adaptive  scores  obtained  with  computer-administered 
items  and  a  computer  calibration? 

Method 

Forty  items  from  each  of  four  ASVAB  content  areas — general  science  (GS),  arithmetic 
reasoning  (AR),  word  knowledge  (WK),  and  shop  information  (SI) — ^were  administered  by 
computer  to  one  group  of  Navy  recruits  and  by  P&P  to  a  second  group.  These  data  were  used  to 
obtain  computer-based  and  P&P-based  calibrations  of  the  items.  Each  calibration  was  then  used 
to  estimate  item  response  theory  adaptive  scores  for  a  third  group  of  recruits  who  had  received  the 
items  by  computer.  The  effect  of  medium  of  administration  was  assessed  by  comparative  analyses 
of  the  scores  using  the  alternative  calibrations. 

Testing  was  conducted  at  the  Recruit  Training  Center  in  San  Diego,  CA.  ASVAB  scores  of 
record  were  obtained  for  nearly  all  of  the  recruits  and  were  used  to  assess  whether  the  groups  were 
comparable  in  ability  levels. 

Results  and  Discussion 

Results  of  the  reliability  analyses  indicate  that  random  errors  due  to  calibration  have  equivalent 
variance  across  different  media.  These  results  suggest  that  the  use  of  item  parameters  obtained  in 
a  P&P  calibration  will  not  affect  the  reliability  of  CAT-ASVAB  test  scores,  an  important  concern 
of  the  ACAP  program. 

Results  of  the  regression  and  correlation  analyses  show  statistically  significant  medium- 
of-administration  effects.  The  regression  results  showed  effects  on  AR,  WK,  and  SI;  and  the 
correlation  results  showed  effects  on  GS  and  WK. 
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Condu^ons 


Results  of  regression,  correlation,  and  reliability  analyses  conducted  to  evaluate  calibration 
medium-of-administration  effect  on  adaptive  scores  indicate  that,  although  statistically  significant 
medium  effects  were  found  on  some  content  areas,  these  effects  did  not  affect  the  reliability  of  the 
CAT-ASVAB  scores. 

Recommendatioiis 

Although  these  findings  support  the  use  of  the  P&P  parameters  of  the  current  CAT-ASVAB 
item  pool,  further  hypothesis  testing  with  an  expanded  reliability  model  is  recommended  to 
elucidate  the  signiticant  effects.  In  addition,  analyses  of  individual  item  parameters  may  be 
necessary  for  understanding  these  effects. 
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Introduction 


Background 

The  Navy  Personnel  Research  and  Development  Center  is  conducting  research  to  design  and 
evaluate  a  computerized  adaptive  test  (CAT)  as  a  potential  replacement  for  the  paper-and-pencil 
(P&P)  Armed  Services  Vocational  ^titude  Battery  (ASVAJB).  In  sui^XMt  of  this  effort,  the 
Accelerated  CAT-ASVAB  Program  (ACAP)  is  evaluating  item  pools  specifically  developed  for 
computerized  adaptive  testing. 

An  important  question  in  the  development  of  item  pools  for  computerized  a(hq>tive  tests  is 
whether  data  for  calibrating  items  should  be  collected  by  a  P&P  or  a  computer  administration  of  the 
items.  Although  research  shows  that  computerized  adaptive  tests  with  P&P  item  calibrations  can 
have  validities  comparable  to  conventional  P&P  tests  (Moreno,  Segall,  &  Kieckhaefer,  1985, 
pp.  29-33),  how  much  less  than  optimal  these  computerized  adaptive  tests  might  be  is  unknown. 

The  concern  about  medium  of  administration  in  item  calibration  is  that  item  parameters  for 
some  types  of  items  (e.g.,  items  with  long  paragraphs  or  with  graphics)  may  differ  between 
computer  and  P&P  administrations.  This  could  result  in  less-than-optimal  item  selection  and  score 
estimation  in  adaptive  tests.  If  P&P  administrations  do  not  yield  precise  enough  calibrations,  items 
must  be  administered  by  computer  during  calibration  just  as  they  are  during  testing. 

Objective 

The  objective  of  this  study  was  to  evaluate  the  effect  on  adaptive  scores  of  using  a  P&P 
calibration.  Specifically,  to  what  extent  do.  adaptive  scores  obtained  with  computer-administered 
items  and  a  P&P  calibration  correspond  to  adaptive  scores  obtained  with  computer-administered 
items  and  a  computer  calibration? 


Method 

Fixed  blocks  of  items  were  administered  by  computer  to  one  group  of  examinees  and  by  P&P  to 
a  second  group.  These  data  were  used  to  obtain  computer-based  and  P&P-based  calibrations  of  the 
items.  Each  calibration  was  then  used  to  estimate  item  response  theory  (IRT)  adaptive  scores 
(thetas)  for  a  third  group  of  examinees  who  had  received  the  items  by  computer,  llie  effect  of 
medium  of  administration  was  assessed  by  comparative  analyses  of  the  thetas  using  the  alternative 
calibrations. 

Subjects 

The  subjects  were  Navy  recruits  who  were  randomly  assigned  to  one  of  three  groups.  Data  were 
collected  for  2,955  examinees  with  989  in  Group  1  (computer),  978  in  Group  2  (P&P),  and  988  in 
Group  3  (computer).  These  sample  sizes  provide  enough  data  for  independent  calibrations,  since 
simulation  results  obtained  by  Hulin,  I>asgow,  and  Jarsons  (1983,  pp.101-110)  suggest  that 
substantially  larger  samples  produce  little  improvement  in  the  precision  of  item  characteristic 
curves  and  scores,  given  the  number  of  items  (40)  used  in  these  calibrations. 
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Testing  was  conducted  at  a  Recniit  Training  Center  in  San  Diego,  CA.  ASVAB  scores  of  record 
were  obtained  for  nearly  all  of  the  recruits  and  were  used  to  assess  whether  the  groups  were 
comparable  in  ability  levels. 

Items 

The  items  were  taken  from  pools  specifically  developed  in  support  of  CAT-ASVAB  by 
Prestwood,  Vale,  Massey,  and  Welsh  (1985).  Forty  items  from  each  of  four  ASVAB  content  areas 
(general  science,  arithmetic  reasoning,  word  knowledge,  and  shop  information)  were 
administered  by  computer  to  Groups  1  and  3.  and  by  P&P  to  Group  2.  The  items  were 
conventionally  administered  in  ascending  order  of  difficulty,  using  the  difficulties  obtained  by 
Prestwood  et  al.  (1985).  The  three  groups  received  the  same  items  with  the  same  instructions  and 
practice  problems,  in  the  same  order  and  with  the  same  time  limits.  Although  only  4  of  the  11 
CAT-ASVAB  content  areas  were  included  in  this  study,  the  medium-of-administration  (MOA) 
subtests  were  administered  in  the  same  order  as  in  the  CAT-ASVAB.  Tune  limits  were  prorated 
from  95%  completion  times  for  the  same  content  areas  in  ACAP,  with  10%  added  to  allow  for  a 
higher  completion  rate.  Subtest  order  and  time  limits  are  shown  in  Table  1. 

Table  1 


Mediuin*of«Administration  Subtest 
Order  and  Tinie  Limits 


Subtest 

Time 

(Minutes) 

General  Science  (GS) 

19 

Arithmetic  Reasoning  (AR) 

63 

Word  Knowledge  (WK) 

16 

Shop  Information  (SD 

17 

Total 

IIS 

The  40  items  included  34  high-usage  items  (usage  obtained  from  ACAP  simulation  studies) 
and  six  “seeds”  (not-scored  items  administered  for  the  purpose  of  gathering  data  for  on-line 
calibration  research).  The  booklet  format  was  the  same  as  that  used  in  the  original  P&P 
calibration  by  Prestwood  et  al.  (1985),  and  the  computer  format  was  the  same  as  that  used  in 
ACAP.  Practice  problems  and  instructions  were  also  as  in  ACAP. 

Item  Calibrations 

IRT  parameter  estimates  based  on  the  three-parameter  logistic  model  (Bimbaum,  1968)  were 
obtained  in  separate  calibrations  for  each  of  the  two  computer  groups  (calibrations  Cl  and  C3) 
and  for  the  P&P  group  (calibration  C2).  The  data  sets  on  which  the  calibrations  are  based  are 
labelled  Ul,  U3,  and  U2,  correspondingly.  The  calibrations  were  performed  with  LOGIST6 
(Wingersky,  Barton,  &  Lord,  1982),  a  computer  program  that  uses  a  joint  maximum-likelihood 
approach.  The  design  with  the  corresponding  notation  is  summarized  in  Table  2. 
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Tible2 


Calibration  Design 


Group 

No. 

Medium 

DataSet/ 

Item  Responses 

Item  Parameters/ 
Calilnrations 

1 

Computer 

U1 

Cl 

2 

P&P 

U2 

C2 

3 

Computer 

U3 

C3 

Note,  pap  >  paper  and  pencil. 


Scores 

For  each  recruit  in  Group  3,  three  theta  scores  were  computed:  Tl,  T2,  and  T3  (see  Table  3).  All 
three  scores  were  based  on  U3  responses.  Tl  scores  were  calculated  from  the  computer- 
administration  item  parameters  (Cl).  12  scores  were  based  on  the  P&P-administration  parameters 
(C2),  and  T3  scores  were  calculated  from  the  third  parameter  set  (C3),  also  based  on  a  computer- 
administration.  As  described  below,  Tl  ar.J  T2  were  adaptive  scores,  using  only  10  of  the  40 
responses  from  a  given  examinee,  while  T3  theta  was  nonadaptive,  using  all  40  responses. 

Tables 

Computation  of  Theta  Scores 


Calibration  Parameters 

Response 

Set 

Scoring 

Method 

Tfest 

Length 

Theta 

Cl  (Group  1,  computer) 

U3 

Adaptive 

10  items 

Tl 

C2(Group2,P&P) 

U3 

Adaptive 

10  items 

17 

C3  (Group  3,  computer) 

U3 

Nonad^ve 

40  items 

T3 

Note.  P&P  =  paper  and  pencil. 


Adaptive  Scores 

To  compute  the  adaptive  thetas  (Tl  and  T2),  10-item  adaptive  tests  were  simulated  using  actual 
examinee  responses.  Owen’s  Bayesian  scoring  (Owen,  1975)  was  used  throughout  the  test  to  update 
the  ability  estimate,  and  a  Bayesian  modal  estimate  was  computed  at  the  end  of  the  test  to  obtahi  the 
final  score.  Items  were  select  from  information  tables  on  the  basis  of  maximum  information.  (An 
information  table  consists  of  lists  of  i  <.  ns  by  ability  level  '^thin  each  list,  all  the  items  in  the 
pool — 40  in  this  case — are  arranged  in  descending  order  of  the  values  of  their  information  functions 
computed  at  that  ability  level  The  information  tables  used  in  this  study  were  computed  for  37 
ability  levels  equally  spaced  along  the  [-2.25,  +2.25]  interval). 

Nonadaptive  Scores 

The  nonadaptive  thetas  (T3)  included  all  40  items  in  the  test  Final  thetas  were  computed 
using  the  Bayesian  modal  estimate  (since  all  the  items  go  into  the  score,  it  is  not  necessary  to 
update  the  ability  estimate  after  each  item). 
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Table  3  summarizes  the  method  used  for  computing  the  theta  scores  used  in  the  analyses. 
ASVAB  Scores 


ASVAB  subtest  scores  for  the  four  content  areas  of  interest  and  the  Armed  Forces  Qualification 
Test  (AFQT)  were  obtained  from  the  records  for  most  of  the  examinees.  The  subtests  were  General 
Science  (GS),  Arithmetic  Reasoning  (AR),  Word  Knowledge  (WK),  and  Auto  Shop  (AS).  Notice  that 
the  ASVAB’s  Auto  Shop  subtest  covers  two  contmt  areas:  auto  information  and  shop  information, 
whereas  in  the  CAT- ASVAB  each  area  constitutes  a  separate  subtest  Since  only  shop  information  was 
administered  in  this  study,  ASVAB-AS  was  compared  to  MOA-SI. 

Results  and  Discussion 


Calibration  Samples 

Two  subjects  in  Group  3  (computer)  had  fe^ver  than  10  valid  responses  on  subt^  WK  and  SI,  and 
LOGIST  omitted  them  from  the  calibrations.  These  subjects  were  eliminated  from  all  subsequ^t 
analyses  of  WK  and  ^  I,  Group  3  (computer).  Firal  sample  sizes  were  989  for  Group  1  (computer),  978 
for  the  Group  2  (P&P),  988  for  GS  &  AR  in  Group  3  (computer),  and  986  for  WK  &  SI  in  Group  3 
(computer). 

AFQT  Comparisons 

To  determine  whether  the  three  calibration  groups  were  comparable  in  examinee  ability,  a  one-way 
analysis  of  variance  of  AFQT  by  calibration  group  was  computed.  Results  (Table  4)  clearly  indicate 
that  there  are  no  AF(^  differences  among  the  three  groups.  Sample  sizes  are  slightly  smaller  in 
Tables  4  through  7  because  AF(^  scores  were  not  available  for  some  examinees. 

Table  4 

Analysis  of  Variance: 

AFQT  by  Calibration  Group 


Source 

df 

Sum  of 
Squares 

Mean 

Squares 

F-Ratio 

F-Prob. 

Between  Groups 
Within  Groups 

2 

2923 

29.8 

1247241.6 

14.9199 

426.6990 

0.035 

0.9656 

Ibtai 

2925 

mmiA 

Group  No. 

Count 

Mean 

Standard 

Deviation 

Standard 

Error 

1 

985 

55.5655 

21.0157 

0.6696 

2 

963 

55.3946 

20.4010 

0.6574 

3 

978 

55.3221 

20.5418 

0.6569 

Total 

2926 

55.4279 

20.6499 

0.3818 

Number-Right  Scores 

Tables  5,  6,  and  7  show  means,  standard  deviations,  and  intercorrelations  of  number-right 
scores  for  same-name  ASVAB  and  MOA  subtests  by  calibration  group. 

IhbleS 

ASVAB  vs.  Group  1  (Computer):  Number-Right  Score 
Means,  StandaM  Deviations,  and  Intercorrelations 
(N  =  985) 


P&P  ASVAB 

Group  1  (Computer) 

Subtest 

GS 

AR  WK  AS  AFQT 

GS  AR  WK  SI 

Correlation  Matrix 

P&P  ASVAB 


GS 

1.00 

AR 

0.57 

1.00 

WK 

0.73 

0.52 

1.00 

AS 

0,50 

0.42 

0.48 

1.00 

AFQT 

0.69 

0.85 

0.78 

0.45 

1.00 

Group  1  (Computer) 

GS 

0.81 

0J8 

0.73 

0.53 

0.71 

1.00 

AR 

0.49 

0.73 

0.46 

0.35 

0.71 

057 

1.00 

WK 

0.69 

0.47 

0.77 

0.43 

0.67 

0.72 

0.46 

1.00 

SI 

0.54 

0.44 

0.49 

0.78 

0.46 

0.60 

0.41 

0.48 

1.00 

Means  and  Standard  Deviations 

Min 

4.00 

7.00 

.00 

0.00 

21.00 

8.00 

6.00 

6.00 

6.00 

Max 

25.00 

30.00 

35.00 

25.00 

99.00 

39.00 

40.00 

39.00 

39.00 

Mean 

17.55 

20.22 

26.85 

16.14 

55.57 

25.66 

20.06 

22.82 

22.62 

SD 

4.44 

5.80 

5.41 

5.16 

21.01 

6.23 

5.96 

5.13 

6.88 

Note.  ASVAB  =  Amwd  Services  Vxatkmal  Aptitude  Battery,  P&P  =  paper  and  pencil  GS  =  General  Science,  AR  =  Arithmetic 
Reasonitig,  WK  =  Ward  Kiwwledge,  AS  =  Auto  Shop,  APQT = Armed  Forces  Qualification  Test,  SI  =  Shop  Information. 
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1^516  6 


ASVAB  vs.  Group  2  (P&P):  Number-Right  Score 
Means,  Standard  Deviations,  and  Intercorrelations 
(N^963) 


P&P  ASVAB 

Group  2  (P&P) 

Subtest 

GS 

AR 

WK 

AS 

AFQT 

GS 

AR 

WK 

SI 

Correlation  Matrix 

P&P  ASVAB 

GS 

1.00 

AR 

OJO 

1.00 

WK 

0.74 

0.48 

1.00 

AS 

0J2 

0.39 

0.46 

1.00 

AFQT 

0.69 

0.82 

0.78 

0.42 

1.00 

Group  2  (P&P) 

GS 

0.80 

054 

0.76 

0.51 

0.72 

1.00 

AR 

0.45 

0.72 

0.44 

0.30 

0.68 

0.54 

1.00 

WK 

0.69 

0.45 

0.79 

0.38 

0.68 

0.74 

0.44 

1.00 

SI 

0J3 

0.41 

050 

0.77 

0.46 

056 

0.36 

0.43 

1.00 

Means  and  Standard  Deviations 

Min 

5.00 

5.00 

10.00 

3.00 

17.00 

8.00 

6.00 

5.00 

3.00 

Max 

25.00 

30.00 

35.00 

25.00 

99.00 

39.00 

39.00 

40.00 

40.00 

Mean 

17.44 

20.41 

26.68 

16.16 

55.39 

25.33 

20.22 

22.74 

22.96 

SD 

4.38 

552 

5.32 

5.02 

20.40 

6.18 

5.81 

5.28 

6.84 

Note.  See  Table  5  for  definitions  of  acronyms. 
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Tkhlel 


ASVAB  vs.  Group  3  (Computer):  Number-Right  Score 
Means,  Standard  Deviations,  and  Intercorrelations 
(N*978) 


Subtest 

P&P  ASVAB 

Group  3  (Con^Niter) 

GS 

AR 

WK 

AS 

AFQT 

GS 

AR 

WK 

SI 

Correlation  Matrix 

P&P  ASVAB 

GS 

1.00 

AR 

0.48 

1.00 

WK 

0.74 

0.44 

1.00 

AS 

0.47 

0.35 

0.45 

1.00 

AFQT 

0.66 

0.83 

0.75 

0.40 

1.00 

Group  3  (Computer) 

GS 

0.81 

0.51 

0.75 

0.52 

0.69 

1.00 

AR 

0.46 

0.74 

0.42 

0.31 

0.70 

0.53 

1.00 

WK 

0.70 

0.40 

0.79 

0.38 

0.66 

0.73 

0.45 

1.00 

SI 

0.52 

0.38 

0.49 

0.76 

0.45 

0.60 

0.36 

0.48 

1.00 

Means  and  Standard  Deviations 

Min 

4.00 

5.00 

6.00 

0.00 

20.00 

6.00 

5.00 

5.00 

5.00 

Max 

25.00 

30.00 

35.00 

25.00 

99.00 

40.00 

40.00 

40.00 

39.00 

Mean 

17.42 

20.14 

26.84 

16.14 

55.32 

25.70 

19.70 

22.66 

22.56 

SD 

4.47 

5.67 

5.48 

4.93 

20.54 

6.17 

5.94 

5.22 

7.03 

Note.  See  Table  S  for  definition  of  acronyms. 


Regressum  Analysis 

This  analysis  was  designed  to  test  whether  10-item  adaptive  ability  scores  computed  using 
computer  or  P&P  calibrated  items  are  equivalent.  For  each  of  the  four  content  areas,  the  appendix 
presents  the  regressions  and  scatter  plots  of  T1  on  T3,  and  of  T2  on  T3,  where  T1 ,  T2,  and  T3  are 
as  defined  in  Table  3.  Then,  a  LISREL-based  (Joreskborg  &  Sorbom,  1986)  analysis  was  designed 
to  test  for  the  equality  of  the  regression  lines  of  T1  on  T3  and  of  T2  on  T3.  The  testing  was 
sequential,  first  for  equality  of  slopes  and  then  for  equality  of  intercepts  (if  the  slopes  are  different, 
testing  for  equality  of  intercepts  is  not  required). 
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To  test  for  equality  of  slopes,  a  LISREL  run  that  tested  for  equality  of  covariances  was 
performed  for  each  of  the  four  content  areas.  The  model  specification  for  LISREL  was  as  follows: 

^  ss  covariance  matrix,  free 
=  identity  matrix 
=  error  matrix,  zero 
C  )V(T1,T3)  =  COV(T2,T3) 

Results  are  presented  in  Table  8.  Significant  differences  were  obtained  for  all  content  areas 
except  GS. 


Tables 


Test  for  Equality  of  Covariances: 
COV(Tl,T3)  =  COV(T2,T3) 


Subtest 

ChiSq 

df 

P 

GoF 

AdjGoF 

RMSR 

GS 

.02 

1 

.876 

.999 

.995 

.001 

AR 

10.92* 

1 

.001 

.644 

-1.137 

.011 

WK 

21.38* 

1 

.000 

.495 

-2.028 

.014 

SI 

12.13* 

1 

.000 

.790 

-0.261 

.016 

Notei.  1 .  />  s  probability,  GoF  =  Goodness  of  Fit,  RMSR  a  Root  Mean  Square  Residuals. 

2.  See  Table  5  for  definitions  of  other  acronyms. 

*p  <  .05. 


Correlation  Analysis 

After  the  results  of  the  regression  analyses  w^  obtained,  it  was  decided  to  use  directional 
hypotheses  in  an  attempt  to  explain  the  differences  found.  For  each  of  the  four  content  areas,  the 
Pearson  correlations,  /fTl  on  T3)  and  r(T2  on  T3),  were  obtained,  and  r-tests  of  the  difference  betwe^ 
dependent  correlations  (Cohen  &  Cohen,  1975,  p.  S3)  were  computed.  Table  9  presents  these  results. 

For  subtests  GS  and  WK,  when  correlations  were  computed  between  10-item  adaptive  and 
40-item  nonadaptive  thetas,  r(Tl,T3)  (the  computer-computa*  correlation)  was  significantly  higher 
than  r(T2,T3)  (the  P&P-computer  correlation).  This  is  consistent  with  a  hypothesis  that  thetas  based 
on  tbe  same  calibration  medium  of  administration  are  more  similar  than  thetas  based  on  different 
media  of  administration.  However,  without  further  analysis,  it  is  not  clear  why  these  results  were 
obtained  for  GS  and  WK  and  not  for  AR  and  SL 

Rdialnlity  Analysis 

SiiKe  the  results  from  the  regression  and  correlation  analyses  are  conflicting  and  difficult  to 
interpret,  further  analyses  were  required.  A  design  was  developed  to  assess  the  effect  of  calibration 
medium  on  test  reliabilities.  The  model  and  the  USREL  specifications  are  described  below.  These  tests 
assess  overall  effect  across  the  four  cont^t  areas  simultaneously;  if  a  significant  effect  is  found,  further 
analyses  would  be  required  to  attribute  the  error  to  specific  subtests. 
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Table  9 


/•Test  of  die  Difference  Between  Dependent  Correladons 

Ho:rl3>r23 


Subtest 

AT 

r(Tl,T2) 

KT1.T3) 

r<T2.T3) 

df 

t 

GS 

988 

985 

2.768** 

AR 

988 

0.9586 

985 

-0.060 

WK 

986 

0.9798 

0.9665 

0.9632 

983 

2.114* 

SI 

986 

0.9564 

0.9508 

0.9507 

983 

0.039 

£Jq]G|.  1.  r  -  Peanon  oocielaton  ooefBcicat;  T1  »=  Calibnlkxi  Group  1  (coo^iuter).  lO-item  additive 
theta;  T2  =  Cafibralioa  Group  2  (PAP),  10-ilem  adaptive  theta;  T3  =  Calibration  Choup  3  (compMer), 

40-iteiii  nonadi^itive  theta. 

2.  See  Table  S  for  definition  of  other  acronyms. 

*AU  responses  fipom  Group  3  (computer)  (U3). 

**p  <  .01  (one  tailed). 

*p  <  .06  (one  tailed). 

Statistical  Model 

A 

Assume  that  the  observed  theta,  6,  values  have  three  components:  the  true  ability  level  e, 
measurement  error  £,  and  random  error  due  to  calibration  S.  Then, 

e=X(e  +  e)  +  5 
0=X4+5 

where  4  =  0  +  £>  the  true  ability  plus  the  error  of  measurement,  and  X  is  a  scale  factor.  Then,  the 
basic  measurement  model  can  be  described  by  the  following  eight  equations: 


Equation 

Subtest 

Score 

Responses 

Item  Parameters 

+  5i 

Tl-GS 

U3 

Computer 

^2  “  ^  ^2+  ^2 

Tl-AR 

U3 

Computer 

^3  =  A.3  ^3  +  83 

Tl-WK 

U3 

Computer 

^4  =  X4  ^  +  S4 

Tl-SI 

U3 

Computer 

^5  *  ^  ^5  +  85 

T2-GS 

U3 

P&P 

^6  =  ^  ^  +  Se 

T2-AR 

U3 

P&P 

^7  =  A/7  ^7  +  87 

T2-WK 

U3 

P«feP 

T2-SI 

U3 

P&P 
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Selecting  the  best-fitting  model  consists  of:  (1)  estimating  the  model  in  which  certain 
parameters  are  set  to  be  equal,  (2)  estimating  a  less  constrained  model,  and  (3)  assessing  the 
statistical  significance  of  the  improvement  in  fit  going  from  the  more  constrained  mottel  to  the 
less  constrained  model.  If  the  more  constrained  model  fits  the  data  as  well  as  the  less  constrained 
model  (Le.,  within  sampling  error  limits),  then  one  may  conclude  that  the  constraints  do  not 
seriously  erode  the  fit  of  the  model. 

In  this  case,  one  model  is  specified  such  that  the  calibration  errors  of  the  pseudo-true  test 
scores  are  constrained  to  be  equ^  for  the  computer-based  and  the  P&P  item  parameters;  another 
model  is  specified  such  that  these  calibration  errors  are  free  to  vary  between  the  two  media  of 
administration.  If  the  constrained  model  provides  just  as  good  a  fit  as  the  free  model,  then 
constraining  calibration  errors  to  be  equal  across  item-parameter  sets  does  not  erode  the  fit  of  the 
model  to  the  data,  and  one  may  conclude  that  the  calibration  errors  of  the  ability  scores  are  equal 
for  computer  and  P&P  item-parameters. 

According  to  the  model,  the  variance-covariance  matrix  £  among  the  observed  scores  has  the 
form: 


where  is  a  diagonal  matrix  with  standard  deviations  in  the  diagonal,  65  is  a  diagonal  matrix  of 
variances  attributable  to  calibration  error,  and  is  the  attenuated  correlation  matrix  among  the 
ability  values  Notice  that  the  mauix  ^  is  attenuated  from  only  one  source  of  error — the  source 
attributable  to  the  calibration;  <I>  is  not  attenuated  with  respect  to  measurement  error.  The  fixed  and 
estimated  parameters  of  this  model  are  displayed  in  Table  10. 

Table  10 


Correlation  and  Variance  Matrices  in  the  Model 


Ccmelation  Matrix  d> 

Subtest  Score 

(Tl-GS) 

CTl-AR) 

@3 

(Tl-WK) 

(Tl-SI) 

% 

(T2-GS) 

@6 

(T2-AR) 

@7 

(T2-WK) 

@8 

(T2-SI) 

'e'l  (Tl-GS) 

1.0 

^(Tl-AR) 

r(^,  ^1) 

1.0 

^  (Tl-WK) 
S4(T1-SI) 

r($4.  §1) 

r($3.  ^2) 
r($4.  ^2) 

1.0 

r(§4.  ^3) 

1.0 

djCTZ-GS) 

1.0 

r(^,  ^1) 

riK  ^l) 

r(t4,  §1) 

1.0 

§6(T2-AR) 

A  A 

rOz.  0i) 

1.0 

r($3.  ^2) 

K^4'  ^2) 

A  A 
^(02.  0i) 

1.0 

$7(T2-WK) 

§1) 

A  A 

r(03.  02) 

1.0 

r(§4. 

A  A 
r(03,  0,) 

r(^.  §2) 

1.0 

^8(T2-SI) 

rSi,  ^1) 

r(^4f  ^2) 

A  A 

r(04,  03) 

1.0 

A  A 

r(04.  0i) 

r($4.  ^2) 

r(K  ^3) 

1.0 

Variance  Matrix 

% 

r(§l.  ^1) 

A  A 
/■(02.  02) 

A  A 

r(%,  03) 

r($4,  §4) 

r(05.  05) 

r(06,  06) 

rSi,  ^) 

r($8.  $8) 
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Notice  in  Table  10  that  the  correlation  of  a  subtest  with  itself  (across  media)  is  equal  to  one. 
and  that  the  correlations  between  same-name  subtests  are  assumed  to  be  equal  both  across  and 
within  calibration  media.  ex^ple,  the  correlation  between  P&P  T2-AR  and  T2-GS 
should  be  represented  by  r(0g,  O5);  however,  it  is  represented  by  r(  O2.  ^i)  because  under  the 
model  all  the  correlations  between  GS  and  AR  are  assumed  to  be  equal;  that  is, 

r(^A)  =  r(K^i)  =  r(%,^  =  • 

LISREL  Models 

To  test  the  model  fit,  two  models  were  specified  and  corresponding  LISREL  runs  were 
performed.  In  Model  1,  the  variances  of  errors  due  to  calibration  (Bj)  were  free  to  vary  between 
the  two  media  of  administration.  In  Model  2,  the  variances  of  errors  due  to  calibration  were 
constrained  to  be  equal  for  same-name  subtests. 

The  d>  constraints  in  Table  10  were  imposed  for  both  Model  1  and  Model  2. 

The  LISREL  output  yields  a  chi-square  statistic  that  is  a  measure  of  how  much  L  differs  from 
S;  that  is,  how  well  the  model  fits  the  data.  The  difference  in  the  chi-squares  from  the  two  models 
is  also  a  chi-square  with  df  equal  to  the  difference  in  df  from  the  two  models.  If  this  difference  is 
not  significant,  then  the  data  satisfy/Ht  the  model  independently  of  the  calibration  errors;  that  is, 
errors  due  to  calibration  across  media  for  same-name  subtests  are  equal. 

The  LISREL  specifications  for  Model  1  were: 

1.  Lambda-X  =  Diagonal  Matrix,  Free. 

2.  PHI  =  Symmetrical  Matrix,  Free. 

3.  Theta-Delta  =  Diagonal  Matrix,  Free. 

The  LISREL  specifications  for  Model  2  were: 

1 .  Lambda-X  =  Diagonal  Matrix,  Free. 

2.  PHI  =  Symmetrical  Matrix,  Free. 

3.  Theta-Delta  (TD)  =  Diagonal  Matrix  with  Constraints. 

4.  TD  Constraint  No.  1:  All  off-diagonal  85  fixed  at  zero. 

5.  TD  Constraint  No.  2:  85  (computer)  =  85  (P&P);  that  is, 

06(14)  =  05(5,5) 

85(2,2)  =  85(6,6) 

85(3,3)  =  85(7,7) 

08(4,4)  =  85(8,8). 

Table  1 1  presents  goodness-of-fit  statistics  for  these  models.  The  likelihood  ratio  chi-square 
value  of  the  model  in  Model  1  was  14.07  with  14  df.  The  result  is  not  statistically  signihcant, 
indicating  that  Model  1  adequately  explains  the  observed  covariance  matrices.  Results  for 
Model  2  show  a  chi-square  value  of  19.57  with  18  df  which  is  also  not  statistically  significant. 
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The  difference  in  chi-squares  between  Model  1  and  Model  2  is  distributed  as  a  chi-square  with 
equal  to  the  difference  in  ^  from  Model  1  and  Model  2.  This  value  (S.SO  with  4  was  not 
significant,  indicating  that  allowing  the  error  term  to  be  free  does  not  change  the  fit  of  the  model. 

Thble  11 

Test  for  Equality  of  Reliabilities 


Model 

No. 

Specification 

ChiSq 

df 

P 

GoF 

AdjGof 

RMSR 

1 

65  s:  free 

14.07 

14 

n.s. 

0.996 

0.991 

0.002 

2 

65  (Computer)  ==  65 
(P&P) 

19.57 

18 

n.s. 

0.995 

0.990 

0.003 

3 

Model  2  -  Model  1 

5.50 

4 

n.s. 

Note,  p  =  probability,  GoF  =  Goodness  of  Fit,  RMSR  =  Root  Mean  Square  Residuals,  n.s.  =  not  significant 


Conclusions 

Results  of  the  regression,  correlation,  and  reliability  analyses  indicate  that,  although 
statistically  significant  medium  effects  were  found  for  some  content  areas,  these  effects  did  not 
affect  the  reliability  of  adaptive  scores. 

Results  of  the  reliability  analyses  indicate  that  random  errors  due  to  calibration  have 
equivalent  variance  across  different  media.  This  suggests  that  the  use  of  item  parameters  obtained 
in  a  P&P  calibration  will  not  affect  the  reliability  of  CAT-ASVAB  test  scores,  an  important 
concern  of  the  ACAP  program. 

The  regression  and  correlation  results  showed  significant  medium-of-administration  effects. 
The  regression  results  showed  effects  on  AR,  WK,  and  SI,  and  the  correlation  results  showed 
effects  on  GS  and  WK.  Because  these  results  can  be  attributed  to  the  effects  of  calibration 
medium  on  the  scale  factor  X  in  the  reliability  analysis,  further  hypothesis  testing  with  the 
reliability  model  is  necessary  to  elucidate  the  scale  effects.  Analyses  of  individual  item 
parameters  may  also  be  necessary  for  understanding  these  effects.  In  addition,  the  results  may  be 
clarified  by  alternative  treatment  of  not-reached  items  when  the  thetas  (Tl,  T2,  and  T3),  are  being 
computed. 


Recommendation 

Although  these  findings  support  the  use  of  the  P&P  parameters  of  the  current  CAT-ASVAB 
item  pool,  further  hypothesis  testing  with  an  expanded  reliability  model  is  recommended  to 
elucidate  the  significant  effects.  In  addition,  analyses  of  individual  item  parameters  may  be 
necessary  for  understanding  these  effects. 
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Appendix 

Regressions,  Correlations,  and  Scatter  Plots 


A-O 


GENERAL  SCIENCE 


T1 »  ADAPTIVE  THETA  (Cl.  U3).  lO-items 
T2«  ADAPTIVE  THETA  (C2,U3).  lO-iWns 
T3  »  NON-ADAPTIVE  THETA  (C3.  U3).  40-ileins 

Analysis  for  988  points  of  2  variables: 

Variable  T2  T1 
Min  -2J830  -2.6970 
Max  2.5110  2.5660 

Sum  66.6930  23.7240 

Mean  0.0675  0.0240 

SD  0.8631  0.8570 


Correlation  Matrix: 

T2  1.0000 

T1  0.9703  1.0000 

Variable  T2  T1 


GENERAL  SCIENCE 


T1  =  ADAPTIVE  THETA  (C1.U3).  lO-ileou 
T2  »  ADAPTIVE  THETA  (C2,  U3).  lO-items 
T3  »  NON-ADAPTIVE  THETA  (C3.  U3).  40-1(6015 

Analysis  for  988  points  of  2  variables: 

Variable  T1  T3 
Min  -2.6970  -3.0300 
Max  2.S660  2.S710 

Sum  23.7240  2.4860 

Mean  0.0240  0.0Q2S 

SD  0.8570  0.9232 


Ccxrelation  Matrix: 

T1  1.0000 

T3  0.9608  1.0000 

Variable  T1  T3 

Regression  Equation  for  Tl: 

T1  =  0.8919  T3  +  0.0217679 


Significance  test  f<v  jHediction  of  Tl 
Mult-R  R-Squared  F(li986)  prob(F) 
0.9608  0.923211856.9617  0.0000 


GENERAL  SCIENCE 


T1  =  ADAPTIVE  THETA  (C1.U3).  lO-items 
T2  =  ADAPTIVE  THETA  (C2.  U3).  lO-itema 
T3-NON-ADAPnVE  THETA  (C3.U3).  40-ilems 

Analysis  for  988  points  of  2  variables: 

VariaUe  T2  T3 
xVlin  -2.5830  -3.0300 

Max  2.5110  2.5710 

Sum  66.6930  2.4860 

Mean  0.0675  0.0025 

SD  0.8631  0.9232 


Conelation  Matrix: 

T2  1.0000 

T3  0.9552  1.0000 

VaiiaUe  T2  T3 

Regression  Equation  for  T2: 

T2  =  0.8931  T3  +  0.0652559 

Significance  test  for  prediction  of  T2 
MuU-R  R-Squaied  F(1^86)  prob  (r) 
0.9552  0.9124  10271.9555  0.0000 


A-3 


ARITHMETIC  REASONING 


T1 »  ADAPTIVE  THETA  (Cl.  U3).  lO-iiems 
T2  =  ADAPTIVE  THETA  (C2,U3).  lO-ittms 
T3  =  NON-ADAPTIVE  THETA  (C3.  U3),  40.ilcins 

Analysis  for  988  points  of  2  variables: 

Variable  T2  T1 
Min  -2.5440  -2.7190 
Max  2.5850  2J890 

Sum  -66.1150  -25.5640 

Mean  -0.0669  -0.02.^C 

SD  0.9463  0.926O 


Correlation  Matrix: 

T2  1.0000 

T1  0.9813  1.0000 

Variable  T2  T1 


T1 


A-4 


ARTTHMETIC  REASONING 


T1  =  ADAPTIVE  THETA  (C1.U3).  lO-items 
T2  -  ADAPTIVE  THETA  (C2.U3).  lO-items 
T3  =  NON-ADAPTIVE  THETA  (C3.  U3).  40-i!eins 

Analysis  for  988  points  of  2  variables: 

VariaWe  T1  T3 

Min  -2.7190  -2.5370 

Max  2.S890  2.9080 

Sum  -25.5640  4.1980 

Mean  -0.0259  0.0042 

SD  0.9266  0.9449 

Correlation  Matrix: 

T1  1.0000 

T3  0.9586  1.0000 

Variable  T1  T3 

Regression  Equation  for  Tl: 

T1  =  0.94  T3  +  -0.0298686 


Significance  test  for  prediction  of  Tl 
Mult-R  R-Squared  F(l,986)  prob(F) 
0.9586  0.9189 11174.2661  0.0000 


A-5 


ARITHMETIC  REASONING 


T1  =  ADAPTIVE  THETA  (C1.U3).  lO-itcms 
T2  =  ADAPTIVE  THETA  (C2.  U3).  lO-items 
T3  =  NON-ADAPTIVE  THETA  (C3.  U3).  40-ittins 

Analysis  for  988  points  ci  2  variables: 


Variable 

T2 

T3 

Min 

-2.5440 

-2.5370 

Max 

2.5850 

2.9080 

Sum 

-66.1150 

4.1980 

Mean 

-0.0669 

0.0042 

SD 

0.9463 

0.9449 

Correlation  Matrix: 

T2 

1.0000 

T3 

0.9587 

1.0000 

Variable 

T2 

T3 

Regression  Equation  for  T2: 

T2  =  0.9602  T3  +  -0.070998 

Significance  test  for  {vediction  of  T2 
Mult-R  R-Squared  F(1^86)  prob(F) 
0.9587  0.9192 11216.0675  0.0000 


-3-2-10123 


T3 


A-6 


WORD  KNOWLEDGE 


Tl  =  ADAPTIVE  THETA  (C1,U3).  lO-ilems 
T2  =  ADAPTIVE  THETA  (C2.  U3).  lO-itcms 
T3  =  NON-ADAPTIVE  THETA  (C3.U3).  40-items 

Analysis  for  986  points  of  2  variables: 

VariaWc  T2  T1 

Min  -2.2940  -2.4290 

Max  2.6370  2J080 

Sum  33.4100  11.9230 

Mean  0.0339  0.0121 

SD  0.8531  0.8767 

Correlation  Matrix: 

T2  1.0000 

T1  0.9798  1.0000 

Variable  T2  T1 


-3-2-10123 


T1 


A-7 


WORD  KNOWLEDGE 


T1  =  ADAPTIVE  THETA  (Cl .  U3).  lO-items 

T2  =5  ADAPTIVE  THETA  (C2.  U3).  lO-items 

T3=:NON-ADAPTIVE  THETA  (C3.U3).  40-iteins 

Analysis  for  986  points  2  variables: 

Variable  T1  T3 

Min  -2.4290  -2.5000 

Max  2.5080  2.9960 

Sum  11.9230  52260 

Mean  0.0121  0.0053 

SD  0.8767  0.8920 

Correiatioi:  Matrix: 

T1  1.0000 

T3  0.9665  1.0000 

Variable  T1  T3 

Regression  Equation  for  Tl: 

T1  =  0.95  T3  +  0.00705733 

Significance  test  for  prediction  of  Tl 
Mult-R  R-Squared  F(1^84)  prob(F) 

0.9665  0.9341 13958.1727  0.0000 


-3-2-10123 


T3 


A-8 


WORD  KNOWLEDGE 


T1  -  ADAPTIVE  THETA  (C1,U3).  lO-items 

T2- ADAPTIVE  THETA  (C2,U3).  lO-iiems 

T3-NON-ADAPTIVE  THETA  (C3.U3).  40-items 

Analysis  f(v  986  points  oi  2  variables: 

VaiiaUe  T2  T3 

Min  -2.2940  -2.5000 

M*ix  16370  2.9960 

Sum  33.41C0  51260 

Mean  0.0339  0.0053 

SD  0.8531  0.8920 

CcKielation  Matrix: 

T2  1.0000 

T3  0.%32  1.0000 

Variable  T2  T3 

Regression  Equation  for  T1 
T2  »  0.9211  T3  +  0.0290022 

Significance  test  for  {Hvdiction  of  T2 
MuU-R  R-Squared  F(l,984)  prob  (F) 

0.9632  0.9277 12624.8208  0.0000 


-3-2-10123 


T3 


A-9 


SHOP  INFORMATION 


T1  =  ADAPTIVE  THETA  (Cl.  U3).  10-ilcins 

T2  =  ADAPTIVE  THETA  (C2.U3).  lO-items 
T3  =  NON-ADAPnVE  THETA  (C3.  U3),  40-i«cnis 

Analysis  for  986  points  of  2  vmiables: 

VariaWc  T2  T1 

Min  -2.7960  -2.6640 

Max  2.7030  2.6750 

Sum  12.1180  40.9270 

Mean  0.0123  0.0415 

SD  0.8962  0.8656 

Conelation  Matrix: 

T2  1.0000 

T1  0.9564  1.0000 

Variable  T2  T1 


-3-2-10  1  2  3 


T1 


A-10 


SHOP  INFORMATION 


Tl  =  ADAPTIVE  THETA  (C1.U3).  lO-iieiM 

T2«  ADAPTIVE  THETA  (C2.U3),  10-iteiM 

T3«NON-ADAPTIVE'raETA  (0.03).  40-iieiM 

Analysis  for  986  points  of  2  variables: 

Variable  T1  T3 

Min  -2.6640  -3.6840 

Max  2.67S0  2.S460 

Sum  40.9270  -10.4760 

Mean  0.041S  -0.0106 

SD  0.86S6  0.9146 

Cmrelation  Matrix: 

T1  1.0000 

T3  0.9508  1.0000 

Variable  T1  T3 

Regression  Equation  for  Tl: 

Tl  =  0.8999  T3  +  0.0510692 

Significance  test  for  predictian  of  Tl 
MuU-R  R-Squaied  F(1.984)  prob  (F) 

0.9508  0.9041  9273.3592  0.0000 


SHOP  INFORMATION 


T1 »  ADAPTIVE  THETA  (C1,U3).  lO-itcms 
T2  »  ADAPTIVE  THETA  (C2.U3),  10-ilems 
T3  =  NON-ADAPnVE  THETA  (C3,  U3).  40-items 

Analysis  for  986  points  of  2  variables: 

Variable  T2  T3 

Min  -2.7960  -3.6840 

Max  X7030  2.5460 

Sum  12.1180  -10.4760 

Mean  0.0123  -0.0106 

SD  0.8962  0.9146 

Correlation  Matrix: 

T2  1.0000 

T3  0.9507  1.0000 

VariaWe  T2  T3 

Regression  Equation  for  T2: 

T2  =  0.9316  T3  +  0.0221877 

Significance  test  fw  prediction  of  T2 
Mult-R  R-Squared  F(1^84)  prob  (F) 

0.9507  0.9039  9254.1786  0.0000 


T3 


A-12 
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